It is important for physicians to be able to make predictions regarding a patient’s risk for chronic diseases. Demographics and certain behaviors can predispose someone to having certain disorders. This project aims to predict the presence of angina, stroke, and diabetes based on a number of demographic and behavioral measures. Specifically, the project uses machine learning based approaches to both predict patients’ with chronic diseases and narrow down which specific features are predictive of these diseases.
According to information aggregated by the CDC, in a given year, 659,000 people die from heart disease, 795,000 suffer from a stroke, and more than 87,000 die from diabetes (Virani et al., 2021). Given the high prevalence of each of these conditions, it is paramount for clinicians to have models that can accurately predict whether a given patient is likely to suffer either one. The aim of this project is to determine if risk-related behaviors are indicative of these conditions. It is hypothesized that with this given data, machine-learning based classifiers will be able to accurately predict patients with chronic diseases (angina, stroke, and diabetes), and that feature selection methods (linear regression, recursive feature elimination, random forest etc.) will narrow down the list of important predictors for each condition.
This project uses a data-set containing categorical variables related to demographics, health-related behaviors, and history of angina, stroke, and diabetes. To determine what features are most important in predicting the chronic diseases, random forest classification, recursive feature elimination, and linear regressions are used. Additionally, naive Bayes, linear discriminant analysis, and decision trees are used to predict if one of the three conditions for a given patient are present.
Lets load the data in and preprocess as needed.
library(plotly)
library(MASS)
library(nnet)
library(caret)
library(corrplot)
library(randomForest)
library(stats)
library(tm)
library(SnowballC)
library(biclust)
library(tidyverse)
library(psych)
library(mi)
library(car)
library(plotly)
library(gmodels)
library(e1071)
library(C50)
library(Boruta)
library(Rcpp)
# Loading the data in, changing the data to factor variables, and getting some summary information
health <- read.csv("CaseStudy09_HealthBehaviorRisks_Data_V2.csv")
summary(health)## ID AGE_G SEX RACEGR3
## Min. : 1.0 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.: 250.8 1st Qu.:3.00 1st Qu.:1.000 1st Qu.:1.000
## Median : 500.5 Median :5.00 Median :2.000 Median :1.000
## Mean : 500.5 Mean :4.34 Mean :1.574 Mean :1.345
## 3rd Qu.: 750.2 3rd Qu.:6.00 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :1000.0 Max. :6.00 Max. :2.000 Max. :9.000
## IMPEDUC IMPMRTL EMPLOY1 INCOMG
## Min. :2.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :5.000 Median :1.000 Median :7.000 Median :4.000
## Mean :4.889 Mean :2.216 Mean :5.541 Mean :4.337
## 3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:8.000 3rd Qu.:5.000
## Max. :6.000 Max. :6.000 Max. :8.000 Max. :9.000
## CVDINFR4 CVDCRHD4 CVDSTRK3 DIABETE3
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :1.000
## Mean :1.996 Mean :1.813 Mean :1.905 Mean :1.487
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :4.000
## RFSMOK3 RFDRHV4 FRTLT1 VEGLT1
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.319 Mean :1.257 Mean :1.736 Mean :1.715
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
## TOTINDA
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.574
## 3rd Qu.:2.000
## Max. :9.000
colnames(health) <- c("ID","Age","Sex","Race","School","Marriage","Employment","Income","HeartAttack","Angina","Stroke","Diabetes",
"Smoking","Alcohol","Fruits","Vegetables","Leisure")
health <- subset(health, select = -ID)
for(i in 1:length(colnames(health))){
health[,i] <- factor(health[,i])
}
summary(health$sex)## Length Class Mode
## 0 NULL NULL
summary(health$Angina)## 1 2
## 187 813
summary(health$HeartAttack) ## Heart attack has too few cases to do an analysis## 1 2
## 4 996
summary(health$Stroke)## 1 2
## 95 905
summary(health$Diabetes)## 1 2 3 4
## 745 48 182 25
Angina, heart attacks, strokes, and diabetes are the main conditions we can predict using the risk factors in this data-set. However, only 4 people responded saying that they have had a heart attack previously. This is not enough data to make any reasonable predictions. Luckily, a higher number of people reported having the other conditions so, those conditions can be used for further analysis.
## Prediction Models (Bayes, LDA, Decision Trees)
healthSubset <- sample(nrow(health),floor(nrow(health)*.60))
healthTrain <- health[healthSubset, ]
healthTest <- health[-healthSubset, ]
anginaBayesModel <- naiveBayes(healthTrain, healthTrain$Angina, type = 'class')
anginaPred <- predict(anginaBayesModel, healthTest)
anginaBayesCT <- CrossTable(anginaPred,healthTest$Angina)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Angina
## anginaPred | 1 | 2 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 76 | 1 | 77 |
## | 257.435 | 60.386 | |
## | 0.987 | 0.013 | 0.192 |
## | 1.000 | 0.003 | |
## | 0.190 | 0.002 | |
## -------------|-----------|-----------|-----------|
## 2 | 0 | 323 | 323 |
## | 61.370 | 14.395 | |
## | 0.000 | 1.000 | 0.807 |
## | 0.000 | 0.997 | |
## | 0.000 | 0.807 | |
## -------------|-----------|-----------|-----------|
## Column Total | 76 | 324 | 400 |
## | 0.190 | 0.810 | |
## -------------|-----------|-----------|-----------|
##
##
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(anginaBayesCT$prop.row[1,1], anginaBayesCT$prop.row[1,2], anginaBayesCT$prop.row[2,1], anginaBayesCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))strokeBayesModel <- naiveBayes(healthTrain, healthTrain$Stroke, type = 'class')
strokePred <- predict(strokeBayesModel, healthTest)
strokeBayesCT <- CrossTable(strokePred,healthTest$Stroke)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Stroke
## strokePred | 1 | 2 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 40 | 0 | 40 |
## | 314.344 | 35.900 | |
## | 1.000 | 0.000 | 0.100 |
## | 0.976 | 0.000 | |
## | 0.100 | 0.000 | |
## -------------|-----------|-----------|-----------|
## 2 | 1 | 359 | 360 |
## | 34.927 | 3.989 | |
## | 0.003 | 0.997 | 0.900 |
## | 0.024 | 1.000 | |
## | 0.002 | 0.897 | |
## -------------|-----------|-----------|-----------|
## Column Total | 41 | 359 | 400 |
## | 0.102 | 0.897 | |
## -------------|-----------|-----------|-----------|
##
##
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(strokeBayesCT$prop.row[1,1], strokeBayesCT$prop.row[1,2], strokeBayesCT$prop.row[2,1], strokeBayesCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))diabetesBayesModel <- naiveBayes(healthTrain, healthTrain$Diabetes, type = 'class')
diabetesPred <- predict(diabetesBayesModel, healthTest)
CrossTable(diabetesPred,healthTest$Diabetes)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Diabetes
## diabetesPred | 1 | 2 | 3 | 4 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 300 | 1 | 0 | 1 | 302 |
## | 23.851 | 16.928 | 47.565 | 7.170 | |
## | 0.993 | 0.003 | 0.000 | 0.003 | 0.755 |
## | 1.000 | 0.040 | 0.000 | 0.083 | |
## | 0.750 | 0.002 | 0.000 | 0.002 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 2 | 0 | 23 | 0 | 0 | 23 |
## | 17.250 | 323.438 | 3.623 | 0.690 | |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.058 |
## | 0.000 | 0.920 | 0.000 | 0.000 | |
## | 0.000 | 0.058 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 3 | 0 | 1 | 63 | 0 | 64 |
## | 48.000 | 2.250 | 277.830 | 1.920 | |
## | 0.000 | 0.016 | 0.984 | 0.000 | 0.160 |
## | 0.000 | 0.040 | 1.000 | 0.000 | |
## | 0.000 | 0.002 | 0.158 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 4 | 0 | 0 | 0 | 11 | 11 |
## | 8.250 | 0.688 | 1.732 | 344.997 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.028 |
## | 0.000 | 0.000 | 0.000 | 0.917 | |
## | 0.000 | 0.000 | 0.000 | 0.028 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 300 | 25 | 63 | 12 | 400 |
## | 0.750 | 0.062 | 0.158 | 0.030 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
##
##
ldaAngina <- lda(data=healthTrain, Angina~.)
ldaAnginaPred <- predict(ldaAngina,healthTest)
anginaCT <- CrossTable(ldaAnginaPred$class,healthTest$Angina)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Angina
## ldaAnginaPred$class | 1 | 2 | Row Total |
## --------------------|-----------|-----------|-----------|
## 1 | 59 | 15 | 74 |
## | 143.642 | 33.694 | |
## | 0.797 | 0.203 | 0.185 |
## | 0.776 | 0.046 | |
## | 0.147 | 0.037 | |
## --------------------|-----------|-----------|-----------|
## 2 | 17 | 309 | 326 |
## | 32.606 | 7.648 | |
## | 0.052 | 0.948 | 0.815 |
## | 0.224 | 0.954 | |
## | 0.042 | 0.772 | |
## --------------------|-----------|-----------|-----------|
## Column Total | 76 | 324 | 400 |
## | 0.190 | 0.810 | |
## --------------------|-----------|-----------|-----------|
##
##
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(anginaCT$prop.row[1,1], anginaCT$prop.row[1,2], anginaCT$prop.row[2,1], anginaCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))ldaStroke<- lda(data=healthTrain, Stroke~.)
ldaStrokePred <- predict(ldaStroke,healthTest)
strokeCT <- CrossTable(ldaStrokePred$class,healthTest$Stroke)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Stroke
## ldaStrokePred$class | 1 | 2 | Row Total |
## --------------------|-----------|-----------|-----------|
## 1 | 34 | 6 | 40 |
## | 218.051 | 24.903 | |
## | 0.850 | 0.150 | 0.100 |
## | 0.829 | 0.017 | |
## | 0.085 | 0.015 | |
## --------------------|-----------|-----------|-----------|
## 2 | 7 | 353 | 360 |
## | 24.228 | 2.767 | |
## | 0.019 | 0.981 | 0.900 |
## | 0.171 | 0.983 | |
## | 0.018 | 0.882 | |
## --------------------|-----------|-----------|-----------|
## Column Total | 41 | 359 | 400 |
## | 0.102 | 0.897 | |
## --------------------|-----------|-----------|-----------|
##
##
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(strokeCT$prop.row[1,1], strokeCT$prop.row[1,2], strokeCT$prop.row[2,1], strokeCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))ldaDiabetes<- lda(data=healthTrain, Diabetes~.)
ldaDiabetesPred <- predict(ldaDiabetes,healthTest)
diabetesCT <- CrossTable(ldaDiabetesPred$class,healthTest$Diabetes)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Diabetes
## ldaDiabetesPred$class | 1 | 2 | 3 | 4 | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 272 | 22 | 26 | 6 | 326 |
## | 3.093 | 0.130 | 12.511 | 1.461 | |
## | 0.834 | 0.067 | 0.080 | 0.018 | 0.815 |
## | 0.907 | 0.880 | 0.413 | 0.500 | |
## | 0.680 | 0.055 | 0.065 | 0.015 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 2 | 0 | 1 | 0 | 0 | 1 |
## | 0.750 | 14.062 | 0.158 | 0.030 | |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.002 |
## | 0.000 | 0.040 | 0.000 | 0.000 | |
## | 0.000 | 0.002 | 0.000 | 0.000 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 3 | 26 | 2 | 37 | 6 | 71 |
## | 13.945 | 1.339 | 59.606 | 7.031 | |
## | 0.366 | 0.028 | 0.521 | 0.085 | 0.177 |
## | 0.087 | 0.080 | 0.587 | 0.500 | |
## | 0.065 | 0.005 | 0.092 | 0.015 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 4 | 2 | 0 | 0 | 0 | 2 |
## | 0.167 | 0.125 | 0.315 | 0.060 | |
## | 1.000 | 0.000 | 0.000 | 0.000 | 0.005 |
## | 0.007 | 0.000 | 0.000 | 0.000 | |
## | 0.005 | 0.000 | 0.000 | 0.000 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 300 | 25 | 63 | 12 | 400 |
## | 0.750 | 0.062 | 0.158 | 0.030 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##
##
DecisionTreeAngina <- C5.0(healthTrain[,-9], healthTrain$Angina)
DTAPred <- predict(DecisionTreeAngina,healthTest[,-9])
confusionMatrix(table(DTAPred,healthTest$Angina))## Confusion Matrix and Statistics
##
##
## DTAPred 1 2
## 1 60 18
## 2 16 306
##
## Accuracy : 0.915
## 95% CI : (0.8832, 0.9404)
## No Information Rate : 0.81
## P-Value [Acc > NIR] : 3.706e-09
##
## Kappa : 0.7266
##
## Mcnemar's Test P-Value : 0.8638
##
## Sensitivity : 0.7895
## Specificity : 0.9444
## Pos Pred Value : 0.7692
## Neg Pred Value : 0.9503
## Prevalence : 0.1900
## Detection Rate : 0.1500
## Detection Prevalence : 0.1950
## Balanced Accuracy : 0.8670
##
## 'Positive' Class : 1
##
DecisionTreeStroke <- C5.0(healthTrain[,-10], healthTrain$Stroke)
DTSPred <- predict(DecisionTreeStroke,healthTest[,-10])
confusionMatrix(table(DTSPred,healthTest$Stroke))## Confusion Matrix and Statistics
##
##
## DTSPred 1 2
## 1 34 8
## 2 7 351
##
## Accuracy : 0.9625
## 95% CI : (0.9389, 0.9789)
## No Information Rate : 0.8975
## P-Value [Acc > NIR] : 1.126e-06
##
## Kappa : 0.7984
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8293
## Specificity : 0.9777
## Pos Pred Value : 0.8095
## Neg Pred Value : 0.9804
## Prevalence : 0.1025
## Detection Rate : 0.0850
## Detection Prevalence : 0.1050
## Balanced Accuracy : 0.9035
##
## 'Positive' Class : 1
##
DecisionTreeDiabetes <- C5.0(healthTrain[,-11], healthTrain$Diabetes)
DTDPred <- predict(DecisionTreeDiabetes, healthTest[,-11])
confusionMatrix(table(DTDPred,healthTest$Diabetes))## Confusion Matrix and Statistics
##
##
## DTDPred 1 2 3 4
## 1 271 20 32 5
## 2 0 0 0 0
## 3 29 5 31 7
## 4 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.755
## 95% CI : (0.7098, 0.7964)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.4349
##
## Kappa : 0.3131
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.9033 0.0000 0.4921 0.00
## Specificity 0.4300 1.0000 0.8783 1.00
## Pos Pred Value 0.8262 NaN 0.4306 NaN
## Neg Pred Value 0.5972 0.9375 0.9024 0.97
## Prevalence 0.7500 0.0625 0.1575 0.03
## Detection Rate 0.6775 0.0000 0.0775 0.00
## Detection Prevalence 0.8200 0.0000 0.1800 0.00
## Balanced Accuracy 0.6667 0.5000 0.6852 0.50
Naive Bayes and LDA achieve high true negative and true positive scores for predicting angina and stroke as well high accuracies when decision trees are used. Each the models has a high accuracy for the diabetes variable as well.
## Logistic Regressions
anginaLM <- glm(Angina ~ ., data = health, family = "binomial")
summary(anginaLM)##
## Call:
## glm(formula = Angina ~ ., family = "binomial", data = health)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4759 0.0000 0.0000 0.0462 3.4413
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.679e+01 1.318e+04 0.001 0.998983
## Age2 6.230e+00 1.263e+00 4.931 8.16e-07 ***
## Age3 9.018e+00 1.481e+00 6.087 1.15e-09 ***
## Age4 1.199e+01 1.666e+00 7.195 6.25e-13 ***
## Age5 1.622e+01 2.141e+00 7.573 3.63e-14 ***
## Age6 4.158e+01 2.363e+03 0.018 0.985963
## Sex2 -1.829e+00 5.013e-01 -3.649 0.000263 ***
## Race2 -9.829e-01 9.167e-01 -1.072 0.283609
## Race3 -8.209e-01 1.192e+00 -0.689 0.491064
## Race4 6.718e+00 6.856e+00 0.980 0.327136
## Race5 -2.352e+00 2.023e+00 -1.162 0.245042
## Race9 -6.370e-02 1.275e+00 -0.050 0.960150
## School3 -1.737e+01 1.318e+04 -0.001 0.998948
## School4 -2.082e+01 1.318e+04 -0.002 0.998739
## School5 -2.321e+01 1.318e+04 -0.002 0.998595
## School6 -2.457e+01 1.318e+04 -0.002 0.998512
## Marriage2 -2.785e-01 5.513e-01 -0.505 0.613454
## Marriage3 3.870e-01 6.845e-01 0.565 0.571852
## Marriage4 -1.546e+01 1.601e+03 -0.010 0.992298
## Marriage5 3.313e-01 6.096e-01 0.543 0.586829
## Marriage6 2.449e-02 9.976e-01 0.025 0.980418
## Employment2 2.813e+00 2.500e+00 1.125 0.260580
## Employment3 -1.959e+00 1.462e+00 -1.340 0.180261
## Employment4 -3.354e+00 1.176e+00 -2.851 0.004355 **
## Employment5 -7.093e-01 1.115e+00 -0.636 0.524676
## Employment6 -1.641e+00 1.697e+00 -0.967 0.333564
## Employment7 -9.317e-01 6.465e-01 -1.441 0.149553
## Employment8 -1.028e+00 5.656e-01 -1.817 0.069238 .
## Income2 -2.436e-01 8.518e-01 -0.286 0.774941
## Income3 1.170e-04 9.685e-01 0.000 0.999904
## Income4 -6.148e-01 7.638e-01 -0.805 0.420811
## Income5 4.917e-01 6.980e-01 0.704 0.481196
## Income9 1.396e+00 8.940e-01 1.561 0.118455
## HeartAttack2 -2.684e+00 1.753e+00 -1.531 0.125794
## Stroke2 6.602e-01 9.955e-01 0.663 0.507247
## Diabetes2 -1.168e+00 7.027e-01 -1.661 0.096628 .
## Diabetes3 1.436e+01 2.447e+03 0.006 0.995319
## Diabetes4 1.155e+01 6.140e+03 0.002 0.998499
## Smoking2 4.874e+00 9.731e-01 5.009 5.48e-07 ***
## Smoking9 2.076e+00 1.333e+00 1.558 0.119241
## Alcohol2 4.398e+00 1.745e+00 2.520 0.011745 *
## Alcohol9 1.495e+00 1.291e+00 1.158 0.246773
## Fruits2 1.612e+00 4.979e-01 3.238 0.001205 **
## Fruits9 -1.226e+00 1.031e+00 -1.189 0.234391
## Vegetables2 2.313e+00 6.204e-01 3.728 0.000193 ***
## Vegetables9 -4.775e-01 7.541e-01 -0.633 0.526553
## Leisure2 2.131e+00 6.307e-01 3.379 0.000728 ***
## Leisure9 -2.230e+00 1.373e+00 -1.624 0.104401
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 963.69 on 999 degrees of freedom
## Residual deviance: 188.18 on 952 degrees of freedom
## AIC: 284.18
##
## Number of Fisher Scoring iterations: 21
x <- summary(anginaLM)$coefficients[,4] < .05
anginaLMSig <- names(summary(anginaLM)$coefficients[x,4] < .05) ## Saving those varaibles that significant for future variable selection method comparisons
strokeLM <- glm(Stroke ~ ., data = health, family = "binomial")
summary(strokeLM)##
## Call:
## glm(formula = Stroke ~ ., family = "binomial", data = health)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.701 0.000 0.000 0.000 2.397
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.575e+00 2.341e+04 0.000 0.999776
## Age2 6.692e+00 1.365e+00 4.903 9.42e-07 ***
## Age3 1.482e+01 3.009e+00 4.927 8.37e-07 ***
## Age4 3.289e+01 4.074e+03 0.008 0.993559
## Age5 3.551e+01 3.309e+03 0.011 0.991439
## Age6 3.436e+01 2.757e+03 0.012 0.990057
## Sex2 3.718e-01 7.262e-01 0.512 0.608654
## Race2 -9.632e-02 2.671e+00 -0.036 0.971234
## Race3 6.705e+00 2.715e+00 2.470 0.013516 *
## Race4 2.212e-01 5.137e+01 0.004 0.996564
## Race5 1.586e+00 3.227e+00 0.491 0.623205
## Race9 -2.058e+00 1.565e+00 -1.315 0.188573
## School3 -6.445e+00 2.341e+04 0.000 0.999780
## School4 -1.063e+01 2.341e+04 0.000 0.999638
## School5 -1.240e+01 2.341e+04 -0.001 0.999577
## School6 -1.665e+01 2.341e+04 -0.001 0.999433
## Marriage2 4.557e-02 1.047e+00 0.044 0.965298
## Marriage3 -5.448e-01 9.836e-01 -0.554 0.579645
## Marriage4 3.147e+00 8.088e+00 0.389 0.697185
## Marriage5 -2.088e+00 1.072e+00 -1.948 0.051375 .
## Marriage6 3.642e+00 2.224e+00 1.638 0.101523
## Employment2 2.135e-01 5.747e+00 0.037 0.970358
## Employment3 6.420e+00 2.168e+00 2.961 0.003069 **
## Employment4 -6.157e+00 2.513e+00 -2.450 0.014279 *
## Employment5 6.330e+00 1.988e+00 3.184 0.001451 **
## Employment6 1.814e+01 1.565e+04 0.001 0.999075
## Employment7 2.207e+00 1.160e+00 1.903 0.056992 .
## Employment8 2.405e+00 9.568e-01 2.514 0.011938 *
## Income2 -4.978e+00 1.847e+00 -2.695 0.007041 **
## Income3 -3.992e+00 1.894e+00 -2.107 0.035098 *
## Income4 -5.276e+00 1.860e+00 -2.836 0.004561 **
## Income5 -4.874e+00 1.648e+00 -2.958 0.003093 **
## Income9 -2.482e+00 1.597e+00 -1.554 0.120079
## HeartAttack2 1.979e+00 1.932e+00 1.025 0.305522
## Angina2 4.649e-01 1.282e+00 0.362 0.716981
## Diabetes2 -1.031e+00 1.314e+00 -0.785 0.432664
## Diabetes3 1.084e+01 3.299e+03 0.003 0.997379
## Diabetes4 -1.419e+00 1.168e+04 0.000 0.999903
## Smoking2 5.190e+00 1.500e+00 3.460 0.000541 ***
## Smoking9 -3.707e+00 1.965e+00 -1.887 0.059168 .
## Alcohol2 -1.583e+00 1.804e+00 -0.877 0.380374
## Alcohol9 2.200e-01 6.036e+00 0.036 0.970929
## Fruits2 1.399e+00 7.798e-01 1.794 0.072844 .
## Fruits9 -4.617e-01 1.680e+00 -0.275 0.783448
## Vegetables2 3.971e+00 1.116e+00 3.557 0.000375 ***
## Vegetables9 -1.493e+00 1.263e+00 -1.182 0.237295
## Leisure2 3.652e+00 1.050e+00 3.478 0.000504 ***
## Leisure9 -1.794e+00 1.808e+00 -0.992 0.321001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 627.912 on 999 degrees of freedom
## Residual deviance: 88.892 on 952 degrees of freedom
## AIC: 184.89
##
## Number of Fisher Scoring iterations: 22
x <- summary(strokeLM)$coefficients[,4] < .05
strokeLMSig <- names(summary(strokeLM)$coefficients[x,4] < .05) ## Saving those varaibles that significant for future variable selection method comparisons
# Multinomial Linear Regression
diabetesLM <- multinom(Diabetes ~ ., data = health)## # weights: 188 (138 variable)
## initial value 1386.294361
## iter 10 value 485.423916
## iter 20 value 400.229607
## iter 30 value 391.035328
## iter 40 value 389.248554
## iter 50 value 388.918363
## iter 60 value 388.755521
## iter 70 value 388.588214
## iter 80 value 388.554762
## iter 90 value 388.456470
## iter 100 value 388.446908
## final value 388.446908
## stopped after 100 iterations
summary(diabetesLM)## Call:
## multinom(formula = Diabetes ~ ., data = health)
##
## Coefficients:
## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 -23.845640 0.228492 0.1890488 1.102002 -0.1193996 0.1000789 12.2415970
## 3 -10.123903 7.526140 8.3406602 10.293995 11.9331364 13.3489896 -0.8916207
## 4 -6.205156 -4.471158 8.2636577 10.222035 10.7712852 12.8469054 -0.7550098
## Race2 Race3 Race4 Race5 Race9 School3
## 2 1.3270821 -10.911943 -10.1172718 -10.23349935 2.448592 -13.897770
## 3 0.3640057 -2.268112 0.1320356 -0.01904884 2.338475 -4.437903
## 4 -12.3052437 -17.965338 -7.2198419 -9.19831731 4.208112 -8.111618
## School4 School5 School6 Marriage2 Marriage3 Marriage4
## 2 0.5390254 0.6464011 0.8929987 -0.6486809 -1.623709531 0.18674280
## 3 -6.2207739 -8.9783005 -11.6298273 -0.1458774 -0.001513464 -0.07101547
## 4 -8.6991456 -10.5154197 -14.3019830 -0.8386655 -0.452226177 -13.82239802
## Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 -0.54514462 -1.33242728 0.010669566 0.3828141 -10.066596 0.3708215
## 3 -0.94079389 -0.07848923 -0.003073865 -1.1689414 1.254144 0.6417032
## 4 0.01474081 -10.15974267 -10.079925026 1.6887563 -9.293229 0.4854404
## Employment6 Employment7 Employment8 Income2 Income3 Income4
## 2 1.7835135 -0.3997252 0.6013019 -0.40677611 -2.4787821 0.1220308
## 3 0.5621604 0.1089195 -0.4876101 -0.04393232 0.5190654 -0.4356655
## 4 -9.4477142 0.5840835 -0.4333128 0.76629283 1.2107869 0.6719532
## Income5 Income9 HeartAttack2 Angina2 Stroke2 Smoking2
## 2 -0.4871224 -0.05776236 7.0416523 -0.7617192 1.8094802 0.2649059
## 3 -0.3083258 0.34726289 -0.7082079 3.5307840 0.2570182 3.2793888
## 4 1.4394547 1.63426503 -3.0432397 4.7325083 -2.4347133 3.5799597
## Smoking9 Alcohol2 Alcohol9 Fruits2 Fruits9 Vegetables2 Vegetables9
## 2 0.3495308 -0.7915249 -1.3527356 0.1618617 1.0243905 0.7599804 0.8458477
## 3 0.2566212 2.4709877 0.9318741 1.4147436 0.4395236 1.1527861 0.7630708
## 4 -10.2700939 2.3654527 -9.7233765 1.2501587 0.7499800 0.9894764 -10.5734441
## Leisure2 Leisure9
## 2 0.7770696 -0.8974989
## 3 2.4250543 -0.5097800
## 4 1.4348926 -0.2750364
##
## Std. Errors:
## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 111.12640 1.107685 1.169325 1.208408 1.273265 1.268669 64.6700930
## 3 96.20731 92.981636 92.980287 92.979495 92.979599 92.979702 0.2752929
## 4 127.19162 74.372250 123.056849 123.063549 123.064328 123.064456 0.5258059
## Race2 Race3 Race4 Race5 Race9 School3 School4
## 2 0.5264059 1.642437e+02 180.365712 173.269557 1.034636 0.2920375 47.893981
## 3 0.5068182 1.222413e+00 1.049532 1.231685 1.105757 2.6216619 2.591366
## 4 176.6119494 9.528947e-05 145.206562 206.141227 1.265192 2.9627181 2.815176
## School5 School6 Marriage2 Marriage3 Marriage4 Marriage5 Marriage6
## 2 47.894122 47.893165 0.5516379 0.7974547 1.3808488802 0.5229251 1.1805731
## 3 2.634921 2.696002 0.4230981 0.3987608 1.5308131868 0.4095321 0.7007587
## 4 2.870052 3.194266 0.9117707 0.8174344 0.0002699382 0.6715403 144.7473168
## Employment2 Employment3 Employment4 Employment5 Employment6 Employment7
## 2 0.9569608 1.197182 193.653144 0.9093629 1.0911126 0.5920207
## 3 0.6754031 1.328496 1.041996 0.7145045 0.8090641 0.3882478
## 4 157.8802172 1.577241 233.680212 1.2734154 205.8712207 0.6650961
## Employment8 Income2 Income3 Income4 Income5 Income9 HeartAttack2
## 2 0.4538096 0.6316236 1.1421688 0.6147420 0.5590656 0.6638731 76.63196
## 3 0.3373228 0.5470485 0.5594095 0.5348256 0.4963819 0.5807680 18.70123
## 4 0.6326563 1.3110992 1.4828685 1.3040046 1.2511020 1.3597623 17.49951
## Angina2 Stroke2 Smoking2 Smoking9 Alcohol2 Alcohol9 Fruits2
## 2 0.5709167 1.086667 0.5347810 0.9680688 1.1686414 1.3473229 0.4018891
## 3 15.5643487 13.845363 0.3998010 0.9646119 0.4967988 0.8098973 0.2990350
## 4 63.8326779 36.321014 0.6515794 176.5866845 0.9322985 169.0603771 0.5310582
## Fruits9 Vegetables2 Vegetables9 Leisure2 Leisure9
## 2 0.5891724 0.4246063 0.6591481 0.4300163 1.1801469
## 3 0.6486549 0.3050981 0.5492398 0.3361840 0.6796632
## 4 1.2982978 0.5659303 172.8184802 0.6016299 1.2437793
##
## Residual Deviance: 776.8938
## AIC: 1052.894
z <- summary(diabetesLM)$coefficients/summary(diabetesLM)$standard.errors #Getting z-scores
p <- (1 - pnorm(abs(z),0,1))*2 ## Getting the p-values
p## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 0.8300938 0.8365730 0.8715631 0.3617975 0.9252884 0.9371241 0.849863168
## 3 0.9161932 0.9354879 0.9285228 0.9118443 0.8978788 0.8858409 0.001200309
## 4 0.9610899 0.9520612 0.9464598 0.9338014 0.9302537 0.9168585 0.151028005
## Race2 Race3 Race4 Race5 Race9 School3
## 2 0.01170138 0.94702951 0.9552676 0.9529034 0.0179512756 0.000000000
## 3 0.47262325 0.06353354 0.8998868 0.9876607 0.0344455047 0.090496934
## 4 0.94445321 0.00000000 0.9603446 0.9644091 0.0008808087 0.006183338
## School4 School5 School6 Marriage2 Marriage3 Marriage4
## 2 0.991020355 0.9892317103 9.851238e-01 0.2396277 0.04173881 0.8924240
## 3 0.016369252 0.0006557862 1.605218e-05 0.7302569 0.99697170 0.9629989
## 4 0.002000961 0.0002484598 7.556029e-06 0.3576665 0.58010847 0.0000000
## Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 0.29718413 0.2590550 0.9911042 0.7491483 0.9585426 0.6834340
## 3 0.02160503 0.9108186 0.9963687 0.3789144 0.2287452 0.3691272
## 4 0.98248725 0.9440428 0.9490933 0.2843032 0.9682773 0.7030464
## Employment6 Employment7 Employment8 Income2 Income3 Income4 Income5
## 2 0.1021366 0.4995564 0.1851681 0.5195646 0.02998857 0.8426481 0.3835821
## 3 0.4871631 0.7790623 0.1483092 0.9359924 0.35346933 0.4153050 0.5345033
## 4 0.9633968 0.3798384 0.4934006 0.5589069 0.41420468 0.6063436 0.2499176
## Income9 HeartAttack2 Angina2 Stroke2 Smoking2 Smoking9
## 2 0.9306650 0.9267860 0.1821371 0.09587978 6.203503e-01 0.7180547
## 3 0.5498817 0.9697917 0.8205398 0.98518933 2.220446e-16 0.7902117
## 4 0.2294117 0.8619407 0.9408996 0.94655528 3.923081e-08 0.9536220
## Alcohol2 Alcohol9 Fruits2 Fruits9 Vegetables2 Vegetables9
## 2 4.982134e-01 0.3153702 6.871306e-01 0.08208862 0.0734784985 0.1994067
## 3 6.564649e-07 0.2498937 2.233832e-06 0.49803016 0.0001578317 0.1647350
## 4 1.117346e-02 0.9541356 1.856812e-02 0.56349094 0.0803935932 0.9512140
## Leisure2 Leisure9
## 2 7.075133e-02 0.4469572
## 3 5.453415e-13 0.4532258
## 4 1.707871e-02 0.8249915
## Feature Selection
BorutaAngina <- Boruta(Angina ~., data = health, doTrace=0)
BorutaAngina## Boruta performed 99 iterations in 5.156267 secs.
## 8 attributes confirmed important: Age, Alcohol, Diabetes, Leisure,
## School and 3 more;
## 6 attributes confirmed unimportant: Employment, HeartAttack, Income,
## Marriage, Race and 1 more;
## 1 tentative attributes left: Fruits;
df_long <- tidyr::gather(as.data.frame(BorutaAngina$ImpHistory), feature, measurement)
plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all Features",
xaxis = list(title="Features"),
yaxis = list(title="Importance"),
showlegend=F)BorutaAnginaVars <- getSelectedAttributes(BorutaAngina)
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainAngina <- rfe(healthTrain[,-9], healthTrain[,9],sizes=c(10, 15), rfeControl=control)
rf.trainAngina##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 10 0.9300 0.7467 0.02212 0.08561 *
## 15 0.9299 0.7413 0.01572 0.06246
##
## The top 5 variables (out of 10):
## Age, Stroke, Smoking, Sex, School
rfAnginaVars <- predictors(rf.trainAngina)
BorutaAnginaVars## [1] "Age" "Sex" "School" "Stroke" "Diabetes" "Smoking" "Alcohol"
## [8] "Leisure"
rfAnginaVars## [1] "Age" "Stroke" "Smoking" "Sex" "School"
## [6] "Diabetes" "Leisure" "Alcohol" "Income" "Vegetables"
anginaLMSig## [1] "Age2" "Age3" "Age4" "Age5" "Sex2"
## [6] "Employment4" "Smoking2" "Alcohol2" "Fruits2" "Vegetables2"
## [11] "Leisure2"
anginaOverlap <- c("Age","Smoking","Alcohol","Leisure") # For predicting angina, these are the features that are in common for all of the feature selection method used.
BorutaStroke <- Boruta(Stroke ~., data = health, doTrace=0)
BorutaStroke## Boruta performed 99 iterations in 3.536886 secs.
## 6 attributes confirmed important: Age, Angina, Diabetes, Income,
## School and 1 more;
## 7 attributes confirmed unimportant: Alcohol, Employment, Fruits,
## HeartAttack, Marriage and 2 more;
## 2 tentative attributes left: Leisure, Smoking;
df_long <- tidyr::gather(as.data.frame(BorutaStroke$ImpHistory), feature, measurement)
plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all Features",
xaxis = list(title="Features"),
yaxis = list(title="Importance"),
showlegend=F)BorutaStrokeVars <- getSelectedAttributes(BorutaStroke)
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainStroke <- rfe(healthTrain[,-10], healthTrain[,10],sizes=c(10, 15), rfeControl=control)
rf.trainStroke##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 10 0.9552 0.7179 0.02576 0.1691 *
## 15 0.9535 0.7013 0.02159 0.1433
##
## The top 5 variables (out of 10):
## Age, Angina, Sex, Diabetes, School
rfStrokeVars <- predictors(rf.trainStroke)
BorutaStrokeVars## [1] "Age" "Sex" "School" "Income" "Angina" "Diabetes"
rfStrokeVars## [1] "Age" "Angina" "Sex" "Diabetes" "School"
## [6] "Leisure" "Income" "Smoking" "Fruits" "HeartAttack"
strokeLMSig## [1] "Age2" "Age3" "Race3" "Employment3" "Employment4"
## [6] "Employment5" "Employment8" "Income2" "Income3" "Income4"
## [11] "Income5" "Smoking2" "Vegetables2" "Leisure2"
strokeOverlap <- c("Age","Income") # For predicting stroke, these are the features that are in common for all of the feature selection method used.
BorutaDiabetes <- Boruta(Diabetes ~., data = health, doTrace=0)
BorutaDiabetes## Boruta performed 99 iterations in 7.435833 secs.
## 8 attributes confirmed important: Age, Alcohol, Angina, Fruits,
## Leisure and 3 more;
## 5 attributes confirmed unimportant: Employment, HeartAttack, Income,
## Marriage, Race;
## 2 tentative attributes left: Sex, Vegetables;
df_long <- tidyr::gather(as.data.frame(BorutaDiabetes$ImpHistory), feature, measurement)
plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all Features",
xaxis = list(title="Features"),
yaxis = list(title="Importance"),
showlegend=F)BorutaDiabetesVars <- getSelectedAttributes(BorutaDiabetes)
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainDiabetes <- rfe(healthTrain[,-11], healthTrain[,11],sizes=c(10, 15), rfeControl=control)
rf.trainDiabetes##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 10 0.8251 0.5074 0.02563 0.08916 *
## 15 0.8152 0.4213 0.02857 0.10381
##
## The top 5 variables (out of 10):
## School, Age, Smoking, Angina, Leisure
rfDiabetesVars <- predictors(rf.trainDiabetes)
BorutaDiabetesVars## [1] "Age" "School" "Angina" "Stroke" "Smoking" "Alcohol" "Fruits"
## [8] "Leisure"
rfDiabetesVars## [1] "School" "Age" "Smoking" "Angina" "Leisure"
## [6] "Stroke" "Sex" "Race" "Vegetables" "Fruits"
p## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 0.8300938 0.8365730 0.8715631 0.3617975 0.9252884 0.9371241 0.849863168
## 3 0.9161932 0.9354879 0.9285228 0.9118443 0.8978788 0.8858409 0.001200309
## 4 0.9610899 0.9520612 0.9464598 0.9338014 0.9302537 0.9168585 0.151028005
## Race2 Race3 Race4 Race5 Race9 School3
## 2 0.01170138 0.94702951 0.9552676 0.9529034 0.0179512756 0.000000000
## 3 0.47262325 0.06353354 0.8998868 0.9876607 0.0344455047 0.090496934
## 4 0.94445321 0.00000000 0.9603446 0.9644091 0.0008808087 0.006183338
## School4 School5 School6 Marriage2 Marriage3 Marriage4
## 2 0.991020355 0.9892317103 9.851238e-01 0.2396277 0.04173881 0.8924240
## 3 0.016369252 0.0006557862 1.605218e-05 0.7302569 0.99697170 0.9629989
## 4 0.002000961 0.0002484598 7.556029e-06 0.3576665 0.58010847 0.0000000
## Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 0.29718413 0.2590550 0.9911042 0.7491483 0.9585426 0.6834340
## 3 0.02160503 0.9108186 0.9963687 0.3789144 0.2287452 0.3691272
## 4 0.98248725 0.9440428 0.9490933 0.2843032 0.9682773 0.7030464
## Employment6 Employment7 Employment8 Income2 Income3 Income4 Income5
## 2 0.1021366 0.4995564 0.1851681 0.5195646 0.02998857 0.8426481 0.3835821
## 3 0.4871631 0.7790623 0.1483092 0.9359924 0.35346933 0.4153050 0.5345033
## 4 0.9633968 0.3798384 0.4934006 0.5589069 0.41420468 0.6063436 0.2499176
## Income9 HeartAttack2 Angina2 Stroke2 Smoking2 Smoking9
## 2 0.9306650 0.9267860 0.1821371 0.09587978 6.203503e-01 0.7180547
## 3 0.5498817 0.9697917 0.8205398 0.98518933 2.220446e-16 0.7902117
## 4 0.2294117 0.8619407 0.9408996 0.94655528 3.923081e-08 0.9536220
## Alcohol2 Alcohol9 Fruits2 Fruits9 Vegetables2 Vegetables9
## 2 4.982134e-01 0.3153702 6.871306e-01 0.08208862 0.0734784985 0.1994067
## 3 6.564649e-07 0.2498937 2.233832e-06 0.49803016 0.0001578317 0.1647350
## 4 1.117346e-02 0.9541356 1.856812e-02 0.56349094 0.0803935932 0.9512140
## Leisure2 Leisure9
## 2 7.075133e-02 0.4469572
## 3 5.453415e-13 0.4532258
## 4 1.707871e-02 0.8249915
diabetesOverlap <- c('School','Smoking',"Alcohol",'Leisure') # For predicting diabetes, these are the features that are in common for all of the feature selection method used.Each of the classification methods obtained high accuracy in predicting whether a given patient had one of the chronic conditions. However, the cohorts were unbalanced, with healthy patients far out weighing the ones with a chronic condition. Future work should attempt collect more balanced cohorts. Demographic and behavioral variables important for predicting angina include age, smoking, alcohol, and leisure; for stroke, age and income; and for diabetes, smoking, education, alcohol, and leisure. These chosen variables are based on the overlap between the three feature selection methods, but it might be reasonable to include other variables that overlap between only two of the feature selection methods. For example, smoking was identified by logistic regression and RFE to be an important predictor for stroke but not by the random forest algorithm. Future work would benefit from including more behaviors, both helpful and negative for the given conditions.
Thank you to Dr. Dinov for providing the data-set.
Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Callaway CW, et al. Heart disease and stroke statistics—2021 update: a report from the American Heart Associationexternal icon. Circulation. 2021;143:e254–e743.